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Abstract 

The genomes of related species contain valuable information on the history of the considered taxa. Great apes in 
particular exhibit variation of evolutionary patterns along their genomes. However, the great ape data also bring new 
challenges, such as the presence of incomplete lineage sorting and ancestral shared polymorphisms. Previous methods for 
genome'Scale analysis are restricted to very few individuals or cannot disentangle the contribution of mutation rates and 
fixation biases. This represents a limitation both for the understanding of these forces as well as for the detection of 
regions affected by selection. Here, we present a new model designed to estimate mutation rates and fixation biases from 
genetic variation within and between species. We relax the assumption of instantaneous substitutions, modeling sub' 
stitutions as mutational events followed by a gradual fixation. Hence, we straightforwardly account for shared ancestral 
polymorphisms and incomplete lineage sorting. We analyze genome-wide synonymous site alignments of human, chim- 
panzee, and two orangutan species. From each taxon, we include data from several individuals. We estimate mutation 
rates and GC'biased gene conversion intensity. We find that both mutation rates and biased gene conversion vary with 
GC content. We also find lineage'Specific differences, with weaker fixation biases in orangutan species, su^esting a 
reduced historical effective population size. Finally, our results are consistent with directional selection acting on coding 
sequences in relation to exonic splicing enhancers. 

Key words: phylogenetics-population genetics model, mutation rates, biased gene conversion, rate heterogeneity, coding 
sequence evolution, primates evolution. 



Introduction 

The increased availability of sequenced genomes both from 
closely related species and from individuals of the same spe- 
cies, offers a great opportunity to study the speciation and 
evolutionary history of populations at different timescales, 
provided we can properly model the process of sequence 
evolution using inter- and intraspecific data together. The 
role of mutation and selection are of particular interest in 
this context. Mutation introduces genetic diversity, the raw 
material of evolution. Natural selection, along with neutral 
fixation biases and random genetic drift, can cause alleles 
newly introduced by mutations to increase in frequency 
and reach fixation. For comparative analysis that aim to 
detect selection and identify functional elements, disentan- 
gling the contribution of these forces is important. 

In the past, phylogenetic methods focused on interspecies 
data, whereas population genetics was mainly concerned 
with intraspecies patterns. Classical population genetics 
methods can test the presence of selection, but do not in- 
clude divergence data from multiple species (except as out- 
groups, see e.g., McDonald and Kreitman 1991; Schneider 
et al. 2011). Standard phylogenetic models instead infer sub- 
stitution rates but not mutation rates and fixation biases. 
There are a few exceptions, for example, the mutation-selec- 
tion codon model of Yang and Nielsen (2008). This model 



assumes the same nucleotide mutational process for all 
codon positions, and estimates a fitness parameter for each 
codon, allowing to test the presence of selection on codon 
usage from interspecies data. 

Some methods use both population genetics and phylo- 
genetics models. For example, it is possible to estimate 
phylogenetic trees by reconstructing the genealogies of indi- 
viduals from different species using the multi-species coales- 
cent. Liu (2008) assumes no recombination within genes and 
free recombination among genes. RoyChoudhury et al. (2008) 
assume no new mutations, so that all the divergence among 
taxa originates from change in allele frequencies in standing 
variation. This method has been recently generalized to allow 
new mutations along the population tree, but is still limited 
to bi-allelic sites (Bryant et al. 2012). All these coalescent- 
based methods assume neutrality. 

Wilson et al. (2011) proposed a combined phylogenetic- 
population genetics approach that analyzes population 
data from different species and estimates a distribution of 
selective coefficients in coding regions. Recently, Cronau 
et al. (2013) developed a model similar to that of Wilson 
et al. (2011) and applied it to noncoding sequence data. 
Both the latter methods assume a standard substitution 
model along the phylogeny relating the species, and require 
all polymorphisms to be recent. In fact, the population ge- 
netics model allowing for intraspecific differences is only used 
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at the tips of the tree, and not in inner phylogenetic branches 
and nodes. 

Here, we introduce a POIymorphism-aware phylogenetic 
MOdel (PoMo), that, similarly to the model of Wilson et al. 
(2011), uses both polymorphism and divergence data simul- 
taneously. However, for PoMo, we do not assume that poly- 
morphisms originate from recent mutations. In our 
phylogenetic continuous-time Markov chain, polymorphisms 
are present both at terminal and ancestral nodes of the spe- 
cies tree. In this way, we can naturally account for ancestral 
shared polymorphisms (Clark 1997) and incomplete lineage 
sorting (Maddison and Knowles 2006; Pollard et al. 2006). 
Furthermore, by not assuming stationarity, reversibility, 
context-independence or mutational strand-symmetry, our 
model can describe complex mutational scenarios (Hwang 
and Green 2004; Polak and Arndt 2008). We show using 
simulations that, with our new model, we can accurately 
infer relative mutation rates and fixation biases, and that 
the inferences are robust to changes in demography. 

One of the most intriguing aspects of the human genome 
is its exceptional heterogeneity. Base substitution rates differ 
among nucleotides, nucleotide contexts, genomic regions, 
and chromosomes (for a review see Hodgkinson and Eyre- 
Walker 2011). Knowing the intensity and variability of both 
mutation and fixation biases, whether due to selection or 
other forces such as biased gene conversion, is fundamental 
for interpreting evolutionary patterns (Ratnakumar et al. 
2010). For example, coding sequence is a major determinant 
of fitness and adaptation (Eyre-Walker and Keightley 2007), 
but undergoes peculiar evolutionary forces, with transcribed 
sequences evolving differently from the rest of the genome, 
showing, for example, strand-specific substitution rates 
(Hwang and Green 2004). It is therefore appealing to use 
synonymous sites as a neutral reference for coding sequence 
evolution, although selection can affect evolution of synony- 
mous sites involved in the splicing process (Chamary et al. 
2006; Parmley and Hurst 2007). Furthermore, mutation and 
fixation biases can have severe consequences on the fitness of 
individuals and populations (Galtier et al. 2009; Hodgkinson 
and Eyre-Walker 2011). 

We performed a comprehensive study of evolutionary pat- 
terns of synonymous sites in great apes (humans [Homo sa- 
piens], chimpanzees [Pan troglodytes], and orangutans [Pongo 
abelii and Pon. pygmaeus]). By using PoMo on polymorphism 
and divergence data simultaneously, we were able to over- 
come the limitations of previous studies, in particular disen- 
tangling the contributions of mutation and fixation biases to 
the evolution of synonymous sites. We first estimate global 
patterns of coding sequence evolution in great apes genome- 
wide, including a comparison of lineage-specific trends. Then, 
we show evidence in favor of variation in mutation and fix- 
ation rates between genomic regions with different base com- 
position, contributing to the long-standing debate regarding 
the origin and maintenance of GC-content variation (Eyre- 
Walker and Hurst 2001). Finally, we consider variation in evo- 
lutionary patterns within exons, examining evidence suggest- 
ing recent directional selection on synonymous sites. 



New Approaches 

We developed a new approach PoMo that uses polymor- 
phism and divergence data simultaneously, and can estimate 
relative mutation rates, disentangling them from fixation 
biases in their contribution to substitution patterns. Similar 
to classical phylogenetic approaches (reviewed in Whelan 
et al. 2001), our model is a continuous-time Markov chain. 
We assume that a phylogenetic tree relates the species con- 
sidered, and that nucleotide sequences evolve along it. 
Phylogenetic methods usually include only a reference 
genome for each species considered, and ignore intraspecies 
diversity. Here, in contrast, we use data from multiple within- 
population individuals to infer allele frequencies for each 
taxon (fig. 1A and see Materials and Methods). Similar to 
classical phylogenetic approaches, we model all sites as inde- 
pendent and discard haplotype structure. This way, we ignore 
information regarding recombination events, but we also 
bypass the problem of accounting for all possible coalescent 
trees, which is of elevated computational complexity when 
considering large samples (Dutheil et al. 2009). 

We include polymorphisms as states of the Markov chain, 
in addition to the four nucleotide states of classical nucleotide 
models. In fact, in our Markov chain, two nucleotides can be 
present simultaneously at one site for one species/population. 
If a polymorphism is present at a tip, it means that the cor- 
responding species has an observed polymorphism at the 
corresponding site and at the corresponding allele frequency 
(fig. IB and see Materials and Methods). 

Although classical models assume instantaneous substitu- 
tions, we separate the mutation and fixation processes. In 
fact, we model sequence evolution as a gradual process 
made by small allele frequency changes (fig. 1C and see 
Materials and Methods). As in classical phylogenetic 
models, the states in inner nodes/branches are usually un- 
known. This uncertainty is accounted for by considering the 
probability of each possible combinations of ancestral states 
via the Felsenstein pruning algorithm (Felsenstein 1981). In 
PoMo, we add the possibility of polymorphisms at various 
allele frequencies at inner nodes and branches (fig. ID and see 
Materials and Methods). This means that we account for 
ancestral polymorphisms and in particular for ancestral 
shared polymorphisms (Charlesworth et al. 2005) and incom- 
plete lineage sorting (when two speciation events are sepa- 
rated by a lapse of time not sufficient for polymorphisms to 
reach fixation, see Maddison and Knowles 2006). The param- 
eters in PoMo do not merely describe substitution rate, but 
are also informative of mutation rates, fixation biases, root 
nucleotide frequencies and branch lengths. All these param- 
eters are estimated by maximum likelihood (ML) (fig. IE and 
see Materials and Methods). 

Although many genomes (including the human genome) 
are not in base composition equilibrium, most phylogenetic 
models assume equilibrium and reversibility for convenience 
(Squartini and Arndt 2008). Here, we do not assume equilib- 
rium or reversibility. Furthermore, because mutations in 
human coding sequences are thought to be strand-asymmet- 
ric and context-dependent (Hwang and Green 2004; Polak 
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Fig. 1. Parameter estimation with PoMo. (A) Data from synonymous sites of each of the four species considered are collected. For each species, 10 
alleles are sampled (the figure depicts data from a single site). (B) Each site of each species is associated with a state in PoMolO according to its allele 
counts. (C-D) Given a set of parameter values, the likelihood of each site is calculated. (C) Transition probabilities between nodes are calculated 
according to the PoMolO rate matrix. For simplification, the figure shows only two alleles, while the full model has four alleles (supplementary table S9, 
Supplementary Material online). (D) Following the Felsenstein pruning algorithm (Felsenstein 1981), we sum probabilities over all combinations of states 
at inner nodes. (£) The likelihood of all sites is combined, and the process is iterated with different parameter values until we find those that maximize 
the likelihood. These values (mutation rates, fixation biases, and root nucleotide frequencies) are our final estimates. 



and Arndt 2008), we explicitly account for, and measure, both 
phenomena (see Materials and Methods). 

Results 

Simulations 

To assess the precision of our methods in parameter estima- 
tion, we performed forward population genetics simulations 
with simuPOP (Peng and Kimmel 2005) on a phylogenetic 
tree. Our simulations closely mimic the features (divergence 



and diversity) of the great ape data set (see Materials and 
Methods). We reliably inferred the simulated parameter 
values when more than 10^ sites were provided (fig. 2 and 
supplementary figs. S10-S12, Supplementary Material 
online), far fewer than in the real data set 2x10^, see 
Materials and Methods). We observed errors at levels of 
^ 5% or below for branch lengths and ancestral and equi- 
librium nucleotide frequencies, and at most 10% for relative 
mutation rates. The intensity of selection was slightly 
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Fig. 2. Performance for simulated data. Mutational and frequency parameters simulated were as estimated in the highest GC-content bin (see Materials 
and Methods). Intensity of selection for GC versus AT was set to 4NeS = 1. On X axis is the number of sites in the data set used. Each box plot 
represents 10 simulations. The errors in the estimation, on the Y axis, were calculated as the Euclidean distance between the vector of estimated 
parameters and the true values, normalized by the Euclidean norm of the true vector. (A) Error in estimation of fixation biases (6 entries vector, one for 
each substitution type, blue box plot), non-CpG mutation rates (6 entries, red), and CpG hypermutability (single-entry vector, orange). (B) Error in 
estimation of branch lengths (green), ancestral nucleotide frequencies (pink), and equilibrium nucleotide frequencies (yellow). (C) Estimates of. GC 
versus AT fixation bias (blue), GC* in sites not preceded by C and not followed by G (yellow), GC* in sites not preceded by C and followed by G (green), 
and GC* in sites preceded by C and followed by G (red). The horizontal dashed lines represent the respective true values used for the simulations. 




underestimated, and required more data for acceptable infer- 
ence (fig. 2 and supplementary fig. SI 2, Supplementary 
Material online). This bias is probably due to the small 
number of polymorphic states used and to the fact that we 
ignore sampling variance. 

We measured the running time of our method on simu- 
lated data (supplementary table S8, Supplementary Material 
online). ML estimations required less than 30 min on a single 



processor (2.66 GHz 6-Core Intel Xeon) for both the basic 
PoMolO and the asy-CpG-PoMolOb models for a genome 
scale data set (up to 500 kB). This suggests that PoMo could 
be applied to genome-wide data from dozens of species 
simultaneously, that the state space could be expanded to 
include a larger virtual population (table 1), or to incorporate 
more model parameters to describe complex evolutionary 
scenarios. 
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Table 1. Computing Times Required with Increasing N. 





PoMolO 


PoMo20 


PoMo30 


PoMo40 


Number of states 


58 


118 


178 


238 


CPU time (s) 


112 


513 


3,465 


4,946 



Note. — Running times for ML estimation of model parameters. Values shown rep- 
resent numbers of seconds for a data set with 10^ sites and simulated with high GC 
content, selection for GC and context-dependency. Estimations were performed with 
the standard multi-threaded version of HyPhy (HYPHYMP) on a Mac OS X with 
2.66 GHz 6-Core Intel Xeon processors. 

We also performed simulations including up to four dif- 
ferent demographic events (bottleneck, expansion, migration, 
and reduction) in the same phylogeny (see Materials and 
Methods). Our approach proved robust in these cases (sup- 
plementary figs. S13-S15, Supplementary Material online). 
Demographic events usually bias estimation of selection in 
population genetics methods that use only polymorphism 
data (Haddrill et al. 2005; Keightley and Eyre-Walker 2007; 
Zeng and Charlesworth 2009). In contrast, we not only con- 
sider the site frequency spectrum, but also divergence 
patterns. 

Analysis of Great Apes Whole-Exome Data 
We extracted synonymous sites from coding sequence align- 
ments of different species (human, chimp, and orangutan) 
and different individuals within species (see Materials and 
Methods). From these sites, using PoMo, we inferred relative 
mutation rates, fixation biases, and nucleotide frequencies at 
equilibrium and at the root. We first estimated global pat- 
terns from whole-exome data, then focused on variation be- 
tween lineages, regions with different GC content, and sites in 
different positions within exons. 

Global Estimates of Mutation Rates 

Using different variants of PoMo, we estimated relative mu- 
tation rates from the global data set. Unsurprisingly, transi- 
tions (mutations between A and C and between C and T) had 
higher rates than transversions (the remaining mutations), in 
particular when CpG context was not accounted for (fig. 3A). 
We compared our results with those of Lynch (2010), who 
measured the rate of new deleterious nonsense and missense 
mutations in humans. We find notable differences; in partic- 
ular. Lynch (2010) estimated lower transition rates relative to 
transversions (fig. 3A). One explanation for this is that mis- 
sense and nonsense mutations are enriched in transversions 
(see Discussion, supplementary information, and fig. SI, 
Supplementary Material online). 

Regions near the transcription start site undergo peculiar 
substitutional patterns, with, in particular, reduced CpG con- 
text effects (Polak and Arndt 2008). Consistent with these 
observations, our hypermutability estimates greatly differ be- 
tween first exons and other exons (fig. 3B). After removing 
first exons, our relative mutation rate estimates are very sim- 
ilar to the phylogenetic estimates of (Duret and Arndt 2008; 
fig. 3B), despite the fact that they did not account for fixation 
biases and did not restrict their analysis to coding sequences. 

Previous studies have suggested the presence of strand- 
asymmetric substitution rates in human transcribed 



sequences (Hwang and Green 2004; Polak and Arndt 2008), 
and different asymmetries in different regions of the tran- 
script (Polak and Arndt 2008). For this reason, we included 
strand-specific mutation and fixation biases in PoMo (see 
Materials and Methods, model asy-CpG-PoMolOb). Our anal- 
yses with this model support the idea that substitutional 
asymmetries are due to mutation rates, and not fixation 
biases (fig. 3C). Furthermore, we detected the same muta- 
tional asymmetries as predicted by Polak and Arndt (2008), 
and, again as expected based on the latter study, first exons 
show different asymmetries from the other exons (supple- 
mentary fig. S3, Supplementary Material online). 

Global Estimates of GC-Biased Gene Conversion and 
Base Composition 

Although fixation biases can be caused by directional selec- 
tion, the genome-wide fixation bias favoring GC over AT al- 
leles in mammals is generally attributed to GC-biased gene 
conversion (gBGC; Duret and Galtier 2009). The effect of 
gBGC is similar to selection (Nagylaki 1983), and therefore 
the intensity of gBGC is usually expressed in terms of AN^s. 
After accounting for context dependencies and mutational 
asymmetries (see Materials and Methods, model asy-CpG- 
PoMolOb), our estimate of gBGC is 4NeS = 0.62, and is dif- 
ferent from 0 according to both the Akaike Information 
Criterion (AlC) and the Bayesian Information Criterion 
(BIG). On the other hand, accounting for differences in fitness 
between G and C, and between A and T does not improve the 
fit of the model according to AlC or BIC. We note, however, 
that the inferred fixation biases depend on the mutation 
model (supplementary fig. S4, Supplementary Material online; 
Hernandez et al. 2007). 

Estimates of root, present and equilibrium nucleotide fre- 
quencies show that base composition is not at equilibrium in 
great apes, with GC content decreasing over time (supple- 
mentary fig. S5, Supplementary Material online). 

Lineage-Specific Estimates 

Global patterns presented earlier should be interpreted as 
averages along the phylogenetic tree considered. However, 
substitution patterns vary considerably. For example, Polak 
et al. (2010) showed that the equilibrium GC content (GC*) 
in orangutan genes is lower than that in human and chim- 
panzee. GC* is determined by the GC/AT bias in substitution 
rates, and differences in substitution biases can be caused by 
changes in mutation rates or fixation biases. We investigated 
whether orangutan differs in mutation rates and/or fixation 
biases from human and chimpanzee. We allowed parameters 
to differ between the human-chimp lineage (comprising the 
human and chimpanzee branches, and the branch of their 
ancestor) and the orangutan lineage (comprising the two 
orangutan branches, and the branch of their ancestor). 
Allowing lineage-specific fixation biases and mutation rates 
resulted in improvements in both AlC (25.31) and BIC (12.83) 
scores (see table 2 for details). Our estimate of gBGC intensity 
in human-chimp was almost double that in orangutan 
(rs 0.7 vs. 0.35; table 2), even when we allowed different 
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mutation rates in different lineages. We conclude that a lower 
gBGC, possibly due to reduced effective population size, con- 
tributed to the lower GC* estimated in orangutan. 

Variation among Exons 

One well-studied aspect of genomic variation in mammals is 
GC content (Bernardi 2000; Eyre-Walker and Hurst 2001). 
Previously, either variation in mutation biases (reviewed in 
Duret 2009; Hodgkinson and Eyre- Walker 201 1 ), gBGC (Duret 
and Galtier 2009), or selective pressure (Bernardi 2000), have 



been suggested to cause variation in GC content in mammals, 
but until now, no analytical framework was available to infer 
the relative importance of these processes. Our new approach 
represents a great opportunity for disentangling, and quanti- 
fying, mutational, and fixation biases variation. 

We binned exons according to their GC content at syn- 
onymous sites (GC4) from lowest to highest, so that all bins 
have roughly the same number of sites 3.25 x 10^). GC4 
strongly correlates with regional GC content (Clay et al. 1996; 
Duret and Hurst 2001; Eyre- Walker and Hurst 2001). We then 
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Table 2. Lineage-Specific Models. 

Model Number of Parameters AlC Score BIC Score gBGC in Human-Chimp gBGC in Orangutan 

No lineage specificity (null)' 37 — — 0.62 0.62 

2 gBCC'' 38 -25.31 -12.83 0.72 0.35 

2 gBGC, 2 h /<CT, 2 h /(^ 40 -23.69 13.76 0.72 0.35 

2 gBGC, Iftf^ 56 - 321.12 - 83.94 0.69 0.39 

Note. — Comparison of models allowing for variation between the hominid lineage (human, chimp, and the branch from their ancestor to the root) and the orangutan lineage 

(Bornean and Sumatran orangutan and the branch from their ancestor to the root). gBGC is measured as the scaled fitness difference 2NeS between CC and AT alleles (we set 

Sa = St and Sc = Sc). 

"The Null model asy-CpG-PoMolOb. 

''Different gBGC in the two lineages. 

'^Different gBGC and CpG hypermutability (h/icj and h/icA) for the two lineages. 
'^Different gBGC and mutation rates (for every mutation type) in the two lineages. 

We show AlC and BIG differences with respect to the Null model. The best BIC and AlC scores are underlined. 



estimated mutation rates and fixation biases for each bin 
separately. Although most mutation rates vary only slightly, 
CpG hypermutability shows very large differences, being 
strongest in GC-poor exons and weakest in GC-rich exons 
(fig. 4A). Among the other mutation rates, most noticeably, 
/LiAc increases with GC content. These results are robust to 
the number of bins used, and to the exclusion of short exons 
(supplementary figs. S6 and S7, Supplementary Material 
online). 

gBGC also increases with GC content, ranging from 0.2 
to «s 1.2 (fig. 4B). Even after removing the potential biases 
coming from the first and second exons of each gene (fig. 3B), 
mutation rates and gBGC still vary between the extreme GC 
bins according to both AlC and BIC scores (table 3; supple- 
mentary table S2, Supplementary Material online). Although 
we accounted for many mutational biases, modeling site var- 
iation in total mutation rates still resulted in a model im- 
provement according to AlC and BIC (supplementary 
table S7, Supplementary Material online). This suggests that 
more context-dependent or cryptic factors in mutation rate 
variation exist (Hodgkinson et al. 2009). 

It has been inferred that GC content of GC-rich regions is 
decreasing in mammals (Duret et al. 2002; Belle et al. 2004; Gu 
and Li 2006). Alvarez-Valin et al. (2004) claimed that this 
result can be explained with a bias in the method of inference 
(parsimony), and the presence of context-dependent muta- 
tions, regional variation, indels, and alignment errors. Here, we 
account for these problems by using an ML context-depen- 
dent model, and by analyzing synonymous sites of closely 
related species (which are expected to contain negligibly 
few alignment errors and indels). Although we observe that 
GC* is highest in GC-rich exons, the difference in GC* among 
bins is considerably smaller than the difference in present or 
root GC content, meaning that base composition is becoming 
homogenous across the genome (fig. 5). Furthermore, except 
for the GC-poorest bin, GC* is always lower than present and 
root GC content. Number of bins used and exclusion of short 
exons did not affect these results (supplementary fig. S8, 
Supplementary Material online). 

Variation within Exons 

Finally, we addressed the issue of variation in evolutionary 
patterns between exonic positions. Different regions within 



an exon can vary in substitutional and compositional trends. 
5'- and 3' -ends of nonterminal exons are GC-poorer than 
exon centers, and have stronger codon usage bias (Willie 
and Majewski 2004). This has been interpreted as the effect 
of selection on splicing motifs (reviewed in Chamary et al. 

2006) . In agreement with this hypothesis, codon usage bias in 
exon boundaries fits splicing motifs more than in exon cen- 
ters (although not for every amino acid, Parmley and Hurst 

2007) . We analyzed variation in mutation and fixation biases 
within exons. We excluded terminal exons and divided sites in 
3 bins. The 5'-bin contained the first 5 synonymous sites in 
each exon; the 3'-bin contained the last 5 sites; the central bin 
contained all the remaining sites. On each bin, we then esti- 
mated mutation rates and fixation biases as before. 

We observe a higher GC content in exon centers, but ex- 
tremely similar equilibrium frequencies in all bins (fig. 6A). 
Mutation rates do not vary noticeably (supplementary fig. 
S9B, Supplementary Material online). However, there are dif- 
ferences in fixation biases, with boundary sites showing pref- 
erence for A over T, and the opposite pattern in exon centers 
(fig. 6B). Models that allow for these differences are preferable 
according to AlC, but not BIC (table 4; supplementary table 
S3, Supplementary Material online). Nevertheless, estimated 
differences are larger than expected just by error according to 
simulations (cf supplementary figs. S9C and D, Supplemen- 
tary Material online). 

Discussion 

Understanding intensity and variation of mutation and 
fixation biases is fundamental for the interpretation of evo- 
lutionary patterns. With our new model, PoMo, and with 
genome-scale data of within and between-species diversity, 
we disentangled and estimated mutational and fixation biases 
at synonymous sites of great apes. 

Our estimates of mutation rates considerably differ from 
those of Lynch (2010), the only previous study, to our knowl- 
edge, that inferred relative mutation rates in humans while 
accounting for fixation biases (fig. 3A). Part of the difference 
might be due to the timescale considered (recent mutations 
in Lynch 2010, polymorphism and divergence here), and to 
the fact that we also include data from other great apes. 
However, we think that most of the discrepancy derives 
from transversions being over-represented in the missense 
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Fig. 4. Variation in fixation biases and mutation rates witii base composition. Exon alignments were binned in 5 classes according to GC content. On 
Y axis, we show parameter estimates for each bin, on X axis are bins ordered by increasing GC content. Error bars show the profile likelihood 95% 
confidence intervals. If not visible, confidence intervals are too small. (A) Estimation of mutation rates with CpC-PoMol 0. Values on the Y axis represent 
mutation rates normalized by (/.Iac + Mag + Mat + Mca + Mcg)- I^AC stands for mutation rate from A to C, etc. h|iCT stands for CpC hypermutability. 
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sA between C and A, sG-sA between G and A, and sT-sA between T and A. 



Table 3. Modeling 


Variation among 


Exons. 




Model 


Number of 
Parameters 


AlC Score 


BIC Score 


Null' 


39 






Mut'' 


57 


-983.35 


-778.42 


Sel' 


42 


-242.59 


-208.44 


Mut-Sel'' 


60 


-982.14 


-743.06 


Mut (G = C)' 


55 


-988.19 


-806.04 


Mut-Sel (G = C)' 


56 


-987.16 


-793.62 



Note. — Comparison of models for variation in evolutionary patterns with respect to 
GC content. All exons were separated in six bins according to GC content. We 
estimated model parameters on the first and the last bins jointly. 
^asy-CpG-PoMolOb with no difference between bins. 
''Different mutation rates fi^^ for the two bins. 
'^Different selection coefficients s^. 

'^Both different mutation rates fi^:^ and selection coefficients s*. 
^Different mutation rates fi^^, and the constraints Sq — Sc ^rid Sa — Sj. 
^Different mutation rates and selection coefficients s^, and the constraints 
Sc = Sc and sa = st. 

We show AlC and BIC differences with respect to the Null model. The best BIC and 
AlC scores are underlined. 

and nonsense mutations (Gilis et al. 2001; De Maio et al. 2013) 
considered by (Lynch 2010; supplementary information and 
supplementary fig. SI, Supplementary Material online). Our 
analysis shows that estimates of relative mutation rates using 
phylogenetic data, as in Duret and Arndt (2008), are prefer- 
able, but for further comparative studies we suggest to use 
estimates from models accounting for fixation biases such as 
PoMo. 

The strongest fixation bias that we detected favors GC 
over AT. In mammals, this phenomenon is generally attrib- 
uted to gBGC. We inferred slightly lower estimates of gBGC 



than previous studies. Spencer et al. (2006) estimated the 
intensity of gBGC in the human genome from the allele fre- 
quency spectrum, while Lynch (2010) contrasted mutational 
patterns with nucleotide frequencies. The first approach con- 
siders the recent evolutionary past, while the second is infor- 
mative of the fixation bias in the long term, probably on the 
scale of hundreds of millions of years (Duret et al. 2002). Our 
approach is intermediate, as it considers polymorphisms and 
divergence, but not base composition. Lynch (2010) esti- 
mated a gBGC intensity of 4NeS ^ 0.99, which was within 
the range 0.5 < 4NeS < 1.3 from Spencer et al. (2006). We 
estimate 4NeS to be ^ 0.62. Although these values are not 
necessarily comparable, we recognized some additional 
causes for the small discrepancy. First, our method tends to 
slightly underestimate gBGC (fig. 2; supplementary fig. SI 2, 
Supplementary Material online). Second, we studied human- 
chimpanzee-orangutan data, and not only human. We ob- 
served a lower intensity of gBGC on the orangutan lineage 
(Ri 0.35 vs. ^ 0.7 of the human-chimp lineage; table 2). A 
lower fixation bias in favor of GC nucleotides causes a shift in 
substitution rates towards AT, and therefore a reduction in 
GC*. A reduction in GC* in orangutan was previously ob- 
served by Polak et al. (2010). We suggest that the most 
likely explanation for reduced gBGC fixation bias 4NeS in 
orangutan is a difference in historical effective population 
size (Ng), and not in the molecular repair bias itself (s). This 
is consistent with studies suggesting that Ng in orangutan was 
smaller than on human-chimp lineage: 65,000 ± 30,000 for 
the human-chimp ancestor and 45,000 ± 10,000 for the 
human-chimp-gorilla ancestor (Hobolth et al. 2007), in 
contrast to 26,800 ± 6,700 for the Bornean and Sumatran 
orangutan ancestor (Mailund et al. 2011). 
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Furthermore, we investigated variation of mutation and 
fixation biases along the exome with respect to present GC 
content. Regional variation in GC content is one of the most 
fascinating aspects of mammalian genomes, but its causes 
and consequences are still not well understood. Although 
some authors suggested that selection is the cause of GC 
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Fig. 5. Variation in equilibrium and ancestral GC content with base 
composition. Exon alignments were binned in 6 classes according to CC 
content. On Y axis, we show great apes root (yellow), observed (purple), 
and equilibrium (blue) CC content in each bin (on X axis are bins 
ordered by increasing CC content) using CpC-PoMolO. 



content variation origin and maintenance (Bernardi 2000), 
most studies proposed neutral explanations, such as variation 
in mutation rates (Fryxell and Zuckerkandl 2000; Fryxell and 
Moon 2005) or gBGC (Duret and Galtier 2009). Here, we 
jointly estimated variation in mutation and fixation biases, 
therefore accounting for possible confounding effects of one 
on the other. We conclude that mutation rates (and in par- 
ticular CpG hypermutability) and gBGC vary with base com- 
position (fig. 4). Nevertheless, GC content decreases, and over 
time becomes more homogeneous across the genome (fig. 5), 
as concluded by previous studies as well (Duret et al. 2002, 
2006; Meunier and Duret 2004). One of the possible explana- 
tions for the homogenization of GC content are changes in 
the recombination map, and therefore in gBGC intensity 
(Auton et al. 2012). Otherwise, a reduction in effective pop- 
ulation size may have led to a decrease in the intensity and 
variation of gBGC effects. Additional studies on different 
mammalian clades are necessary to determine the most 
likely scenarios. 

Finally, we measured differences in evolutionary patterns 
between exon centers and boundaries. Previous studies of 
base composition suggested that synonymous sites in differ- 
ent exonic positions are subject to different selective pressures 
due to splicing motifs (reviewed in Chamary et al. 2006). We 
confirmed these trends; in fact, we measured a fixation bias 
favoring A over T in boundaries of nonterminal exons, and T 
over A in exon centers (fig. 68). This observation cannot be 
explained with gBGC, and is consistent with expectations of 
the hypothesis of selection on exonic splicing enhancers, since 
those are A-rich and T-poor (Parmley and Hurst 2007). 
However, our findings are only marginally significant, and 
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Table 4. Modeling 


; Variation within 


Exons. 




Model 


Number of 


AlC Score 


BIC Score 




Par3iTieters 






Miill^ 
IMUII 


Dy 








57 


—12 39 


212 31 


Sel' 


42 


0.56 


38.01 


Mut-Sel'' 


60 


-20.54 


241.62 


Sel (G = cr 


40 


-4.69 


7.80 


Mut-Sel (C = C)' 


58 


-12.58 


224.60 



Note. — Comparison of models for variation in evolutionary patterns between exon 
center and exon boundaries. A boundary bin includes 5 sites from 5'- and 3'-ends of 
each exon. A second center bin includes all the remaining sites. Model parameters 
were estimated on both boundary and center bins jointly. 
^asy-CpG-PoMolOb with no difference between bins. 
''Different mutadon rates for the two bins. 
'^Different selection coefficients s^. 

'^Both different mutation rates and fitness coefficients s*. 
^Different Scc vs. Sat fitness. 

^Different mutation rates ji.^^ and Scc vs. Sat fitness. 

We show AlC and BIC differences with respect to the Null model. The best BIC and 
AlC scores are underlined. 

they might be improved by including more individuals from 
more species in the future. 

In conclusion, we presented a new phylogenetic model, 
PoMo, that can infer mutation and fixation biases from pat- 
terns of polymorphisms and divergence in populations/ 
species related by any arbitrary history. We provide the soft- 
ware to replicate our analyses that assumes a sample of 10 
sequences per species. This is a limitation of our implementa- 
tion, and not of the model. In fact, the number of haplotypes 
considered from each population as well as the number of sites 
do not represent a considerable computational burden for our 
methods. Furthermore, as PoMolO has fewer states than a 
codon model, it can be applied to phylogenies spanning sev- 
eral dozens of taxa (Seo and Kishino 2009; Gil et al. 2013). 

PoMo can also be applied to the estimation of phyloge- 
netic trees from population data. In fact, it has the potential 
to improve the resolution of short branches (relative to N^), 
where classical methods for phylogenetic inference often fail 
due to incomplete lineage sorting and shared ancestral poly- 
morphisms. These issues can already be accounted for by 
either making strong assumptions regarding recombination 
events (independent loci and no recombination within loci, 
see e.g., Heled and Drummond 2010) or by computationally 
demanding hidden Markov model approaches (Mailund et al. 
201 1). But, unlike PoMo, neither method is applicable to large 
numbers of individuals within-species, due to many possible 
coalescent trees. 

Until now there are only few clades with genome-wide 
population data available from multiple species, such as pri- 
mates and model organisms. This number is expected to 
grow in the near future, providing great opportunities to 
understand changes in evolutionary patterns. 

Materials and Methods 

Polymorph ism- Aware Phylogenetic Models 
Model Background 

Phylogenetic substitution models represent DNA evolution as 
a continuous-time Markov process along a phylogenetic tree 



T (for a review see Whelan et al. 2001). Different sites are 
generally assumed to evolve independently. Each point of i 
represents a taxon at an instant, and tree bifurcations corre- 
spond to speciation events. Another common assumption is 
homogeneity: the evolutionary process does not change 
through time and among species. Nucleotide substitution 
models associate the points of i to elements of the state 
space {A, C, G, T} with certain probabilities. If a point of i 
is in state C, it means that the corresponding taxon, at the 
corresponding time, had nucleotide C at the considered site 
of the genome. The phylogeny tips correspond to present, 
observable states. 

States assigned to taxa can change in time, and the con- 
tinuous-time Markov process modeling these changes is de- 
fined by an instantaneous rate matrix Q. Each entry Q)j of Q, 
with / ^ j, is the rate at which nucleotide / is replaced by 
nucleotide J. Given Q, for any branch b of length t in t it is 
possible to calculate the transition probability matrix 
P(t) = e*. Entry P,j(t) of P(t) is the probability that the 
end of branch b is in state J, conditioned on the start being 
in state /. Matrix P(t) is used to calculate the likelihood L{d) of 
any parameter values 9 via the Felsenstein pruning algorithm 
(Felsenstein 1981). The likelihood is the conditional probabil- 
ity of the data D given 9. Data D consist of DNA sequence 
alignments. Parameter estimates can be obtained by ML, that 
is, by determining the parameter values 9 that maximize the 
likelihood function. 

State Space 

Our new models are similar in most aspects to standard 
phylogenetic nucleotide substitution models described ear- 
lier. The most important difference is that we expand the 
state space. We do not only include four states associated 
to nucleotides, but also further states representing polymor- 
phisms. Nucleotide states in the new models represent sites 
with a fixed allele. Assuming at most two alleles per site per 
time per taxon, there exist six types of polymorphisms deter- 
mined by the alleles simultaneously present in a taxon: 
{A,C}, {A,G}, {A,T}, {C,G}, {C,T}, and {G,T}. We define 
PoMo N (POIymorphism-aware MOdel with virtual popula- 
tion size N) as a phylogenetic model with N - 1 polymorphic 
states for each type of polymorphism. PoMo N state space has 
therefore 4-l-6(N— 1) elements, which means that any 
taxon at any considered time point can be assigned to any 
of the 4 -I- 6(N - 1) states. The N - 1 states associated to the 
same polymorphism type represent different allele frequen- 
cies within a virtual population of N haploid individuals. For 
example, polymorphic state / (1 < i < N — 1) of type {A,C} 
represents a frequency of i/N for allele A and (N — ;)/N for 
allele C in a virtual population of N -l- 1 haploid individuals. 

Definition of Instantaneous Rates 

It is a common practice in population genetics to approxi- 
mate the dynamics of a large real population with those of a 
small virtual population (Keightley and Eyre-Walker 2007; 
Kaiser and Charlesworth 2009; Zeng and Charlesworth 
2009). To our knowledge, we present the first application of 
this approach to phylogenetics. We assume that the real pop- 
ulation has effective population size Ne, mutation rate per 
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generation from allele / to ], fitness parameter S/ for allele /, 
and number of generations t. We define our virtual popula- 
tion as evolving according to a Moran model (Moran 1958) 
with population size N, mutation rate per generation fiij from 
allele / to j, fitness parameter S/ for allele /, and number of 
generations t. The dynamics of a virtual population with 
properly scaled mutation rate (4Ne/I 4Nfi) and selecdon 
coefficient (2NeS 2Ns), are a good approximation of the 
real population, if time is scaled by the population size in both 
cases (t/N for the virtual population and t/Ne for the real one) 
and if N is sufficiently large (Zeng and Charlesworth 2009, 
observed that even with N as small as 10 reasonable results 
could be achieved). 

We make the assumption that the scaled mutation rate 
(4Ne/I) is low, and allow mutations only at monomorphic 
sites in our virtual population (VogI and Clemente 2012). 
Therefore, while our model is four-allelic, only two alleles 
can be present simultaneously in one population at a site. 
We represent the polymorphic state with / virtual individuals 

carrying allele /, and N - i carrying allele ],as(^^' . j ^. 

The probability that the virtual population in a polymor- 
phic state ( . [\ evolves to the state f . , ' t- ^ > I ) 
in a single virtual generation is 



M] 



N ^ ^ 



/(1+Sj-s,) + (N-/) 

Similarly, the probability to evolve from ( . , ' ■ ! ) '^o 
//-I I \ . \N-i J / 

(N + 2-i 

N - i 



i{^+Sj-S|) + {N-i) N' 



(2) 



The probability with which a new allele is introduced in the 
virtual populadon, that is, of evolving from monomorphic 

state / to polymorphic state ^ ^ ^ / ) 
generation is 



one virtual 



Nfiii. 



(3) 



Within a single generation no other changes are allowed, in 
fact, virtual allele counts can only increase or decrease by one 
per generadon. We call Mm the matrix of probabilides of allele 
frequency changes for one generation. Matrix A/l/^ has dimen- 
sion equal to the number of states, 4 -I- 6(N — 1). 

The last step in defining our model is transforming the 
Markov chain from discrete-time (in number of generations) 
into condnuous-time. A continuous-dme Markov chain is 
defined by its instantaneous rate matnx Q. We set the instan- 
taneous rate matnx of our continuous-dme process as 
Q^, := N(A/1n — I), where D is the identity matnx. Then, the 
probabilities of state changes in coalescent time t/N will be 
given by P(^) = e*^"", where t as before represents the 
number of virtual generations, but can now take noninteger 
values. We list the entnes of the rate matrix in supple- 
mentary table S9, Supplementary Matedal online. 



After fitting the matrix Q^, to real data by ML, we esti- 
mated the scaled fitness parameters in the real population 
(2NeS/) as 2Ns/. The four fitness parameters (sa, sc, sg, arid 
Sj) are defined up to an additive constant, and therefore 
correspond to three free parameters. When only gBGC is 
expected to drive fixation biases, we set Sa = sj and 
Sc = Sc, reducing the number of free parameters descnbing 
fitness differences to one. Likewise, we estimated the scaled 
mutadon rates (4Ne7Z(j) as 4N/li/j. 

Root Frequencies 

Stationarity and reversibility are common and mathemati- 
cally convenient assumptions for phylogenetic models, but 
are often not realistic (Galtier and Gouy 1995; Yang and 
Roberts 1995; Akashi et al. 2006; Gu and Li 2006). Here, we 
do not assume them. Because of nonstationarity of our 
model, state frequencies might change along i, and root 
state frequencies n might differ from the observed frequen- 
cies. To define the 4 -I- 6(N — 1) entnes of n, we use three 
additional free parameters (tta, ttc, and itq, where 
TTj := 1 — jta — TTc — TTc) representing relative frequencies 
of fixed nucleotides at the root. 

The root frequency of the polymorphic state with /' virtual 
individuals carrying allele /, and N - /' carrying allele J is: 



^tj = ^pol I I TTj/Xj/ T I + I TTi/ifj 



1 



N-i 



(4) 



where Knorm is a normalization factor, so that all root frequen- 
cies of polymorphic states sum up to 7Tpo\. Under neutrality 
and rare mutations, the expected proportion of polymorphic 
sites with derived allele count ; in a sample of N individuals is 
proportional to 1// (e.g., see eq. 4.20 in Wakeley 2009). A root 
polymorphism can be denved from both alleles present in the 
population, we therefore take into account both possibilides 
in equation (4). The proportion of polymorphic states at the 
root, TTpoi, is not a free parameter, but is set equal to the 
observed proportion of polymorphic states. In fact, jTpoi 
could not be reliably estimated via ML, and its value did 
not affect the estimation of other parameters nodceably (sup- 
plementary information and table SI, Supplementary 
Material online). The root frequency of a fixed state / is 

^lO - TTpol). 

PoMolO 

All results in this study are based on PoMolO (PoMo N with 
N = 10). PoMolO has 58 states: 4 fixed states and 54 polymor- 
phic states. In fact, for each of the 6 pairs of alleles 
({A,C}, {A,C}, {A,T}, {C,C}, {C,T}, and {C,T}), there are 9 
polymorphic states for the possible allele counts 
({9,1}, {8,2}, . . . ,{1,9}). Therefore, PoMolO has lower com- 
putational cost than a standard codon model (61 states), 
allowing genome-wide analysis of phylogenies with consider- 
able numbers of species. PoMolO approximates real popula- 
tion dynamics with those of a virtual population of 10 
individuals. Although this is a rough approximation, it is ex- 
pected to be sufficient for parameter estimation (Zeng and 
Chadesworth 2009, also confirmed by our simulations re- 
sults). Smaller values of N generally resulted in considerable 



2259 



DeMaioetal. ■ doi;10.1093/molbev/msc131 



MBE 



biases (supplementary fig. SI 6, Supplementary Material 
online). 

For each species and site considered, we randomly ex- 
tracted a sample of haploid size 10 (see Description of 
Data), and trivially associated the observed allele frequencies 
to the corresponding virtual frequency state, ignoring sam- 
pling variance. A limitation of PoMolO is that it requires 10 
sampled sequences for each species. A larger N is expected to 
result in improved estimates, but at computational costs 
(table 1). 

PoMolO Extensions: CpG Hypermutability and 
Strand-Asymmetry 

When we assume no context-dependency or strand-asymme- 
try, we define six free parameters to describe mutation biases, 
one for each unordered pair of nucleotides (/Xac. Mac- etc.). 
Yet, mutation rates in mammals show strong dependency on 
the neighboring bases (Hwang and Green 2004). We extended 
PoMolO to include the strongest context dependency, the 
hypermutability of CpG (nucleotide C followed by G) toward 
TpG or CpA. This is accounted for by an extra parameter 
hficT that describes the mutation rate from C to T and 
from G to A in a CpG context. We call this model CpG- 
PoMolO (for detailed description of rates see supplementary 
information. Supplementary Material online). To estimate pa- 
rameters of CpG-PoMolO, we only used synonymous sites 
whose preceding and following bases are constant among 
the considered species. We generally have three free param- 
eters describing nucleotide frequencies at the root, but with 
context dependency this number increases to 12 to account 
for different frequencies in different CpG contexts. 

We further extended CpG-PoMolO to account for hyper- 
mutability of transversions in CpG context (from CpG to 
ApG, GpG, CpC, and CpT). The resulting model, CpG- 
PoMolOb, has two additional free mutational parameters 
(for details see supplementary information. Supplementary 
Material online). 

Finally, we accounted for strand-specificity of mutation 
rates (Hwang and Green 2004; Polak and Arndt 2008). In 
the resulting model, asy-CpC-PoMolOb, the constraints for 
strand-symmetry (e.g., /Xag = Mtc) sre relaxed, and therefore 
9 extra free mutational parameters are necessary with respect 
to CpG-PoMolOb, for a total of 18 (for details see supplemen- 
tary information. Supplementary Material online). 

Model Implementation 

In this study, we assume that the tree topology is known and 
fixed (supplementary fig. SI 7, Supplementary Material 
online), and we estimate all branch lengths in the 4-species 
rooted tree. Lists of number of free parameters for different 
models are included in tables 2-4. 

Parameter estimation was performed via ML with the con- 
jugate gradient algorithm implemented in HyPhy (Pond et al. 
2005). For this purpose, we produced custom scripts in HyPhy 
Batch Language (supplementary file SI [Supplementary 
Material online] describes asy-CpG-PoMolOb and PoMolO 
for the case sq = sc and Sa = Sj). We always estimated all 
free parameters simultaneously. The scripts that we provide 
require 10 sequences sampled from each species. This is a 



limitation of the present state of our software, but not of 
the model (table 1 and discussion). We used different starting 
points for the conjugate gradient iterations on the whole real 
data set, and observed consistency of different optimization 
runs (supplementary information and supplementary fig. S2, 
Supplementary Material online). 

Description of Data 

Great Apes Data Set 

We constructed an exome-wide, inter- and intraspecies data 
set of alignments of 4-fold degenerate (synonymous) sites 
from H. sapiens, P. troglodytes, Pon. abelii, and Pon. pygmaeus 
(respectively human, chimpanzee, and Sumatran and 
Bornean orangutan). 

First, CCDS (Pruitt et al. 2009) alignments of H. sapiens, 
P. troglodytes, and Pon. abelii (references hg18, panTro2, and 
ponAbe2) were downloaded from the UCSC genome browser 
(http://genome.ucsc.edu, last accessed August 8, 2013). Only 
CCDS alignments satisfying the following requirements were 
retained for the subsequent analyses: divergence from human 
reference below 10%, no gene duplication in any species, start 
and stop codons conserved, no frame-shifting gaps, no gap 
longer than 30 bases, no nonsense codon, no gene shorter 
than 21 bases, no gene with different number of exons in 
different species, or genes in different chromosomes in differ- 
ent species (chromosomes 2a and 2b in nonhumans were 
identified with human chromosome 2). From the remaining 
CCDSs (9,695 genes and 79,677 exons), we extracted synon- 
ymous sites. We only considered third codon positions where 
the first two nucleotides of the same codon were conserved in 
the alignment, as well as the first position of the next codon. 

Then, population data were added to the species align- 
ments. Human single nucleotide polymorphisms (SNPs) from 
59 Yoruban (Nigerian) individuals (haploid sample size 
< 118) sequenced from the 1,000 genomes pilot project 
(1000 Genomes Project Consortium 2010) were downloaded 
(ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/, last accessed 
August 8, 2013) and included the alignments. Similarly, we 
added SNP data of 10 western chimpanzee individuals (hap- 
loid sample size <20) sequenced by the PanMap project 
(Auton et al. 2012) and downloaded from ftp://birch.well. 
ox.ac.uk/haplotypes/ (last accessed August 8, 2013). 
Orangutan SNP data for the two species considered, each 
with five sequenced individuals (haploid sample size < 10, 
Locke et al. 2011), were kindly provided to us by X. Ma (in 
preparation) and are now available online (http://www.ncbi. 
nlm.nih.gov/projects/SNP/snp_viewTable.cgi?type=contact& 
handle=WUCSC_SNP&batch_id= 1054968, last accessed 
August 8, 2013). We sub-sampled 10 alleles without replace- 
ment for each species and site. The final total number of 
synonymous sites included was 1,950,006 (for more details 
on real data sets see supplementary tables S4-S6, Supplemen- 
tary Material online). 

The collection of all synonymous site alignments in the 
great apes data set in valid format for PoMolO is provided as 
supplementary file S2, Supplementary Material online. 
Custom scripts to convert SNP data from VCF v4.0 format 
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into PoMolO states, and to convert multi-species alignments 
into HyPhy input files, are also provided as supplementary file 
S3, Supplementary Material online. 

Simulations 

We simulated a population of 50 diploid individuals evolving 
according to a phylogenetic tree (supplementary fig. SI 7, 
Supplementary Material online), where a branch bifurcation 
represents the duplication and split of a population. Evolution 
was simulated according to a Wright-Fisher model with 
sexual reproduction using simuPOP (Peng and Kimmel 
2005) and custom Python scripts. Phylogeny and mutation 
rates were set so to have similar divergence and diversity levels 
to those in real data (supplementary table S4, Supplementary 
Material online). 

First, we simulated five scenarios, in which we progressively 
added demographic events. The first scenario consisted of a 
constant-size population phylogeny. In the second scenario, 
we added a bottleneck on the human branch. In the third, we 
made a population expansion follow the bottleneck. In the 
fourth, we further added migration between the two orang- 
utan species. Finally, we reduced population size in the 
second half of the chimpanzee branch (for further details 
about simulated demographic events see supplementary in- 
formation. Supplementary Material online). 

In a second set of simulations, we included CpG context. 
We used root frequencies and mutation rates as estimated 
from GC-rich and GC-poor bins (the extreme bins in figs. 4 
and 5; for a detailed description of mutational parameters in 
simulations, see supplementary information. Supplementary 
Material online). For both the GC contents considered, we 
simulated three selective regimes with a GC versus AT fitness 
difference of 4NeS e {1,0, — 1}, for a total of six scenarios. For 
each scenario, we simulated 10^ independent sites, and then 
sub-sampled data sets of varying sizes (ranging from ^0'* to 
5 X 10^ sites), with 10 replicates for each size. 

Supplementary Material 

Supplementary files SI -S3, figures SI -SI 8, and tables S1-S9 
are available at Molecular Biology and Evolution online (http:// 
http:/ /m be.oxfordjournals.org/). 
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